49 research outputs found

    Probabilistic models for CRISPR spacer content evolution

    Get PDF
    BACKGROUND: The CRISPR/Cas system is known to act as an adaptive and heritable immune system in Eubacteria and Archaea. Immunity is encoded in an array of spacer sequences. Each spacer can provide specific immunity to invasive elements that carry the same or a similar sequence. Even in closely related strains, spacer content is very dynamic and evolves quickly. Standard models of nucleotide evolution cannot be applied to quantify its rate of change since processes other than single nucleotide changes determine its evolution. METHODS: We present probabilistic models that are specific for spacer content evolution. They account for the different processes of insertion and deletion. Insertions can be constrained to occur on one end only or are allowed to occur throughout the array. One deletion event can affect one spacer or a whole fragment of adjacent spacers. Parameters of the underlying models are estimated for a pair of arrays by maximum likelihood using explicit ancestor enumeration. RESULTS: Simulations show that parameters are well estimated on average under the models presented here. There is a bias in the rate estimation when including fragment deletions. The models also estimate times between pairs of strains. But with increasing time, spacer overlap goes to zero, and thus there is an upper bound on the distance that can be estimated. Spacer content similarities are displayed in a distance based phylogeny using the estimated times. We use the presented models to analyze different Yersinia pestis data sets and find that the results among them are largely congruent. The models also capture the variation in diversity of spacers among the data sets. A comparison of spacer-based phylogenies and Cas gene phylogenies shows that they resolve very different time scales for this data set. CONCLUSIONS: The simulations and data analyses show that the presented models are useful for quantifying spacer content evolution and for displaying spacer content similarities of closely related strains in a phylogeny. This allows for comparisons of different CRISPR arrays or for comparisons between CRISPR arrays and nucleotide substitution rates

    Motif depletion in bacteriophages infecting hosts with CRISPR systems

    Get PDF
    BACKGROUND: CRISPR is a microbial immune system likely to be involved in host-parasite coevolution. It functions using target sequences encoded by the bacterial genome, which interfere with invading nucleic acids using a homology-dependent system. The system also requires protospacer associated motifs (PAMs), short motifs close to the target sequence that are required for interference in CRISPR types I and II. Here, we investigate whether PAMs are depleted in phage genomes due to selection pressure to escape recognition. RESULTS: To this end, we analyzed two data sets. Phages infecting all bacterial hosts were analyzed first, followed by a detailed analysis of phages infecting the genus Streptococcus, where PAMs are best understood. We use two different measures of motif underrepresentation that control for codon bias and the frequency of submotifs. We compare phages infecting species with a particular CRISPR type to those infecting species without that type. Since only known PAMs were investigated, the analysis is restricted to CRISPR types I-C and I-E and in Streptococcus to types I-C and II. We found evidence for PAM depletion in Streptococcus phages infecting hosts with CRISPR type I-C, in Vibrio phages infecting hosts with CRISPR type I-E and in Streptococcus thermopilus phages infecting hosts with type II-A, known as CRISPR3. CONCLUSIONS: The observed motif depletion in phages with hosts having CRISPR can be attributed to selection rather than to mutational bias, as mutational bias should affect the phages of all hosts. This observation implies that the CRISPR system has been efficient in the groups discussed here. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-663) contains supplementary material, which is available to authorized users

    Postprocessing phylogenies

    Get PDF
    Es werden immer mehr phylogenetische BĂ€ume berechnet. Die berechneten Verwandtschaften zwischen den Arten können sich allerdings widersprechen. In diesem Fall sind Werkzeuge notwendig, welche die Höhe des Unterschiedes berechnen, die Gemeinsamkeiten zweier BĂ€ume extrahieren und mehrere BĂ€ume zusammenfassen indem sie die Unterschiede minimieren. Diese Werkzeuge werden unter dem Begriff ``Phylogenetic Postprocessing'' zusammengefasst. In dieser Arbeit werden zwei Aspekte des Phylogenetischen Postprocessings im Detail untersucht. Zuerst werden Baumdistanzen untersucht. Diese evaluieren den Unterschied zweier BĂ€ume. Die meisten Maße berĂŒcksichtigen dabei nur die topologische Information. Allerdings tragen auch die KantenlĂ€ngen der BĂ€ume Informationen, da sie z.B. eine SchĂ€tzung der Menge an Unterschied zwischen zwei Sequenzen sind. Ein Maß, welches sowohl die Topologie als auch die KantenlĂ€ngen berĂŒcksichtigt, ist die LĂ€nge des kĂŒrzesten Weges durch den Raum aller BĂ€ume mit KantenlĂ€ngen. Dies ist die geodĂ€tische Distanz. Hier prĂ€sentieren wir einen exakten Algorithmus um die geodĂ€tische Distanz zu berechnen, der in exponentieller Zeit lĂ€uft. Vergleiche mit ihren Approximationen zeigen, dass es einen bestimmten Weg gibt, der die geodĂ€tische Distanz gut annĂ€hert und in linearer Zeit berechnet werden kann. Phylogenetische BĂ€ume können auch daraufhin untersucht werden, ob sie statistisch Ă€hnlich oder unterschiedlich sind. Dabei kann ein topologisches Distanzmaß als Teststatistik verwendet und die assoziierten p-Werte werden unter einer Nullverteilung der BĂ€ume berechnet werden. Bei diskreten Testverfahren, muss allerdings die TestgrĂ¶ĂŸe konservativ gewĂ€hlt werden, d.h. sie darf das Signifikanzniveau nicht ĂŒberschreiten. Wir zeigen ein Beispiel auf, bei dem ein Test abgeĂ€ndert werden muss um dies zu gewĂ€hrleisten. Der zweite Aspekt ist die Kombination von BĂ€umen oder allgemein phylogenetischen DatensĂ€tzen. GenbĂ€ume mit sich ĂŒberschneidenden Artenmengen können zu einem sogenannten Supertree zusammengefĂŒgt werden. Eine andere Möglichkeit ist bereits die Genalignments zu kombinieren. Dabei werden die Genalignments aneinandergehangen, d.h. zu einem sogenannten Superalignment kombiniert. Anschließend wird eine Phylogenie aus diesem langen Alignment berechnet. Es gibt auch die dritte Möglichkeit, die Daten auf einer Stufe zwischen Superalignment und Supertree zu kombinieren. Mit Hilfe von Simulationen von Genalignments entlang ModellbĂ€umen können Methoden von diesen drei Stufen verglichen werden. Wir untersuchen verschiedene Parameter, z.B. vollstĂ€ndige oder sich ĂŒberschneidende Artenmengen, gleiche oder unterschiedliche Substitutionsparameter oder unterschiedliche Gentopologien. Die Simulationen zeigen gute Ergebnisse der Matrix-Representation-Methoden im Vergleich zu anderen Supertreemethoden. Weiterhin ist Superalignment gut geeignet bei unterschiedlichen Parametern zwischen den Genen, aber problematisch wenn es viele Unterschiede zwischen den wahren GenbĂ€umen gibt. ZusĂ€tzlich zu diesem praktischen Vergleich von Supertreemethoden sind auch theoretische und praktische Aspekte von Interesse. Daher untersuchen wir die Nullmodelle, die der Supertreerekonstruktion zugrunde liegen. Ein solches Nullmodell ist die Gleichverteilung der Splits, also jeder möglichen Unterteilung der Arten in zwei Mengen. Es stellt sich heraus, dass nur diese Verteilung angemessene Eigenschaften hat, wenn wenig Information vorhanden ist. Ein zweites Nullmodell ist die Gleichverteilung der BĂ€ume. Diese fĂŒgt allerdings eine Verzerrung zugunsten bestimmter Baumstrukturen in splitbasierte Supertreemethoden ein. Diese Verzerrung kann auf die ungleiche Verteilung der Splits in diesem Nullmodell zurĂŒckgefĂŒhrt werden. Schließlich kann ein Supertree auch als Median-Tree definiert werden, also als Baum, der die totale Distanz zu allen BĂ€umen in der Menge minimiert. Der Majority-Rule Consensus wurde als Median-Tree-Methode fĂŒr BĂ€ume mit gleichen Artenmengen beschrieben. FĂŒr BĂ€ume mit sich ĂŒberschneidenden Artenmengen gibt als allerdings unterschiedliche AusprĂ€gungen, und zwar MR(-)supertrees und MR(+)supertrees. Wir prĂ€sentieren Algorithmen um die entsprechenden Distanzen im Matrix-Representation-Framework zu berechnen. Durch die Anwendung ihrer Implementierungen auf simulierte DatensĂ€tze sehen wir deutlich bessere Ergebnisse fĂŒr MR(-) im Vergleich zu MR(+). Es ist naheliegend diesen Unterschied auf eine Verzerrung zugunsten bestimmter Baumstrukturen in MR(+) zurĂŒckzufĂŒhren. Zusammenfassend sehen wir, dass die zwei Aspekte des Phylogenetischen Postprocessings, also Baumdistanzen und Baumkombinationsmethoden, nicht unabhĂ€ngig sind, sondern durch die Definition des Median-Trees verbunden. Daher wird unser VerstĂ€ndnis von Baumdistanzen auch die Kombination von BĂ€umen beeinflussen und umgekehrt.More and more phylogenetic trees are generated, and it frequently occurs that the inferred relationships contradict each other. In this case, tools are necessary which evaluate the amount of difference between two trees, extract the congruencies of two trees, and combine multiple trees by minimizing the incongruencies. These tools are summarized by the term ``phylogenetic postprocessing''. In this thesis, two aspects of phylogenetic postprocessing are investigated in detail. First, tree distance computations evaluate the amount of difference between two trees. Most measures only take the topological information into account. There are a few measures that additionally focus on the branch lengths of the trees. One of these is the length of the shortest path in the space of weighted trees, also known as the geodesic distance. Here, an exact, but exponential-time, algorithm to compute the geodesic distance is presented. Comparisons with its approximations show that there is a particular path that approximates the geodesic distance well and that can be computed in linear time. Phylogenetic trees can also be tested for being statistically similar or different. Then a topological distance measure can be used as a test statistic where the associated p-value is computed under a null distribution of trees. Discrete tests must ensure that the size of the test is conservative, i.e. the size must not exceed the significance level. We present one example where a test has to be modified to ensure this property. Second, gene trees on overlapping taxon sets can be combined into a so-called supertree. Another possibility is to combine the gene alignments directly, namely, to concatenate the gene alignments into a superalignment and to reconstruct a phylogeny from this long alignment. There is also the possibility to combine the data at a level between superalignment and supertree methods. Simulations of gene alignments along model gene trees allow for the comparison of methods from all three levels. We investigate different settings, e.g. complete or overlapping taxon sets, equal or different substitution parameters or different gene topologies. The results show a good performance of matrix representation methods compared to other supertree and medium-level methods. Furthermore, superalignment is well applicable in the case of differing parameters between genes but is problematic when a high level of incongruence is present among the true gene trees. Additionally to the practical evaluation of supertree methods, theoretical and algorithmic aspects are of interest. Therefore we study different null models underlying supertree reconstruction. We find only the distribution of equally likely splits to behave in an appropriate way if little information is present. In contrast, the distribution of equally likely trees inserts a tree shape bias in split-based supertree methods. This bias can be traced back to the unequal split distribution in the null model. Finally, a supertree can also be defined by minimizing the total distance to the trees in the set, i.e. as a median tree. The majority-rule consensus is described as a median tree method for trees on the same taxon set. For trees on overlapping taxon sets, however, different specifications can be used, namely MR(-)supertrees and MR(+)supertrees. We present algorithms to compute the respective distances in the matrix representation framework. Applying their implementation to simulated data sets shows a clearly better performance of MR(-) compared to MR(+). This discrepancy is likely to trace back to a tree shape bias in MR(+). To conclude, we see that the two aspect of phylogenetic postprocessing, tree distances and tree combination methods, are not independent. Instead, they are linked by the definition of the median tree. Thus our understanding of tree distances influences data combination methods and vice versa

    Accuracy of phylogeny reconstruction methods combining overlapping gene data sets

    Get PDF
    Background The availability of many gene alignments with overlapping taxon sets raises the question of which strategy is the best to infer species phylogenies from multiple gene information. Methods and programs abound that use the gene alignment in different ways to reconstruct the species tree. In particular, different methods combine the original data at different points along the way from the underlying sequences to the final tree. Accordingly, they are classified into superalignment, supertree and medium-level approaches. Here, we present a simulation study to compare different methods from each of these three approaches. Results We observe that superalignment methods usually outperform the other approaches over a wide range of parameters including sparse data and gene-specific evolutionary parameters. In the presence of high incongruency among gene trees, however, other combination methods show better performance than the superalignment approach. Surprisingly, some supertree and medium-level methods exhibit, on average, worse results than a single gene phylogeny with complete taxon information. Conclusions For some methods, using the reconstructed gene tree as an estimation of the species tree is superior to the combination of incomplete information. Superalignment usually performs best since it is less susceptible to stochastic error. Supertree methods can outperform superalignment in the presence of gene-tree conflict

    A stable backbone for the fungi

    Get PDF
    Fungi are abundant in the biosphere. They have fascinated mankind as far as written history goes and have considerably influenced our culture. In biotechnology, cell biology, genetics, and life sciences in general fungi constitute relevant model organisms. Once the phylogenetic relationships of fungi are stably resolved individual results from fungal research can be combined into a holistic picture of biology. However, and despite recent progress, the backbone of the fungal phylogeny is not yet fully resolved. Especially the early evolutionary history of fungi and the order or below-order relationships within the ascomycetes remain uncertain. Here we present the first phylogenomic study for a eukaryotic kingdom that merges all publicly available fungal genomes and expressed sequence tags (EST) to build a data set comprising 128 genes and 146 taxa. The resulting tree provides a stable phylogenetic backbone for the fungi. Moreover, we present the first formal supertree based on 161 fungal taxa and 128 gene trees. The combined evidences from the trees support the deep-level stability of the fungal groups towards a comprehensive natural system of the fungi. They indicate that the classification of the fungi, especially their alliance with the Microsporidia, requires careful revision. Our analysis is also an inventory of present day sequence information for the fungi. It provides insights into which phylogenenetic conclusions can and which cannot be drawn from the current data and may serve as a guide to direct further sequencing initiatives. Together with a comprehensive animal phylogeny, we provide the second of three pillars to understand the evolution of the multicellular eukaryotic kingdoms, fungi, metazoa, and plants, in the past 1.6 billion years

    Complete genome sequence of the novel phage MG-B1 infecting bacillus weihenstephanensis

    Get PDF
    Here, we describe a novel virulent bacteriophage that infects Bacillus weihenstephanensis, isolated from soil in Austria. It is the first phage to be discovered that infects this species. Here, we present the complete genome sequence of this podovirus

    Co-transfer of functionally interdependent genes contributes to genome mosaicism in lambdoid phages

    Get PDF
    Lambdoid (or Lambda-like) phages, are a group of related temperate phages that can infect Escherichia coli and other gut bacteria. A key characteristic of these phages is their mosaic genome structure which served as basis for the "modular genome hypothesis". Accordingly, lambdoid phages evolve by transferring genomic regions, each of which constitutes a functional unit. Nevertheless, it is unknown which genes are preferentially transferred together and what drives such co-transfer events. Here we aim to characterize genome modularity by studying co-transfer of genes among 95 distantly related lambdoid (pro-)phages. Based on gene content, we observed that the genomes cluster into twelve groups, which are characterized by a highly similar gene content within the groups and highly divergent gene content across groups. Highly similar proteins can occur in genomes of different groups, indicating that they have been transferred. About 26% of homologous protein clusters in the four known operons (i.e., the early left, early right, immunity, and late operon) engage in gene transfer, which affects all operons to a similar extent. We identified pairs of genes that are frequently co-transferred and observed that these pairs tend to be in close proximity to one another on the genome. We find that frequently co-transferred genes are involved in related functions and highlight interesting examples involving structural proteins, the CI repressor and Cro regulator, proteins interacting with DNA, and membrane-interacting proteins. We conclude that epistatic effects, where the functioning of one protein depends on the presence of another, plays an important role in the evolution of the modular structure of these genomes

    A Consistent Phylogenetic Backbone for the Fungi

    Get PDF
    The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data—a common practice in phylogenomic analyses—introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses

    Characterization of Blf4, an Archaeal Lytic Virus Targeting a Member of the Methanomicrobiales

    Get PDF
    Today, the number of known viruses infecting methanogenic archaea is limited. Here, we report on a novel lytic virus, designated Blf4, and its host strain Methanoculleus bourgensis E02.3, a methanogenic archaeon belonging to the Methanomicrobiales, both isolated from a commercial biogas plant in Germany. The virus consists of an icosahedral head 60 nm in diameter and a long non-contractile tail of 125 nm in length, which is consistent with the new isolate belonging to the Siphoviridae family. Electron microscopy revealed that Blf4 attaches to the vegetative cells of M. bourgensis E02.3 as well as to cellular appendages. Apart from M. bourgensis E02.3, none of the tested Methanoculleus strains were lysed by Blf4, indicating a narrow host range. The complete 37 kb dsDNA genome of Blf4 contains 63 open reading frames (ORFs), all organized in the same transcriptional direction. For most of the ORFs, potential functions were predicted. In addition, the genome of the host M. bourgensis E02.3 was sequenced and assembled, resulting in a 2.6 Mbp draft genome consisting of nine contigs. All genes required for a hydrogenotrophic lifestyle were predicted. A CRISPR/Cas system (type I-U) was identified with six spacers directed against Blf4, indicating that this defense system might not be very efficient in fending off invading Blf4 virus
    corecore